The play store data has enormous potential to drive app making bussiness to success. Actionable insights can be drawn for developers to work on and capture the android market. Each row in this project has values for category, rating,size and much more. In this project data has been analysed to discover key factors responsible for app engagment and success.
To analyse the given data(Play Store App review Analysis) and to determine the various parameters which drive App development to there success.
To determine the public sentiments weather which kind of app is making place in there heart and why?
Our EDA on play store app will help developers to know the sentiments of end user as a result of which appropriate decession by the developer can be taken to motivate the user to use there developed app only.
Well-structured, formatted, and commented code is required.
Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
The additional credits will have advantages over other students during Star Student selection.
[ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
without a single error logged. ]Each and every logic should have proper comments.
You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
# Chart visualization code[ Hints : - Do the Vizualization in a structured way while following "UBM" Rule.
U - Univariate Analysis,
B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)
M - Multivariate Analysis ]
import numpy as np
import pandas as pd
from google.colab import drive
drive.mount('/content/drive')
Mounted at /content/drive
ps_ar = pd.read_csv('/content/drive/MyDrive/almabetter programin asignment/EDA-2- PLAY STORE APP REVIEW ANALYSIS/Play Store Data.csv')
ps_ar
# Dataset Rows & Columns count
ps_ar.shape
(10841, 13)
# Dataset Info
ps_ar.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10841 entries, 0 to 10840 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 App 10841 non-null object 1 Category 10841 non-null object 2 Rating 9367 non-null float64 3 Reviews 10841 non-null object 4 Size 10841 non-null object 5 Installs 10841 non-null object 6 Type 10840 non-null object 7 Price 10841 non-null object 8 Content Rating 10840 non-null object 9 Genres 10841 non-null object 10 Last Updated 10841 non-null object 11 Current Ver 10833 non-null object 12 Android Ver 10838 non-null object dtypes: float64(1), object(12) memory usage: 1.1+ MB
# Dataset Duplicate Value Count
ps_ar.drop_duplicates(inplace=True)
ps_ar.shape
# there were (10841-10358)= 483 duplicate rows and zero duplicate columns, duplicate rows are droped.
# Missing Values/Null Values Count
ps_ar.isna().sum()
App 0 Category 0 Rating 1474 Reviews 0 Size 0 Installs 0 Type 1 Price 0 Content Rating 1 Genres 0 Last Updated 0 Current Ver 8 Android Ver 3 dtype: int64
# Visualizing the missing values
ps_ar.info()##-- there are 1465 null values in rating column, 1 null in type and content rating column, 8 null in current verson, 3 in android verson
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10841 entries, 0 to 10840 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 App 10841 non-null object 1 Category 10841 non-null object 2 Rating 9367 non-null float64 3 Reviews 10841 non-null object 4 Size 10841 non-null object 5 Installs 10841 non-null object 6 Type 10840 non-null object 7 Price 10841 non-null object 8 Content Rating 10840 non-null object 9 Genres 10841 non-null object 10 Last Updated 10841 non-null object 11 Current Ver 10833 non-null object 12 Android Ver 10838 non-null object dtypes: float64(1), object(12) memory usage: 1.1+ MB
There are total 10358 rows and 13 columns in my dataset(no-duplicate column or rows) out of which there are 1465 null values in rating column, 1 null value in type and content rating column, 8 nullvalue in current verson, 3 in android verson, Price column has all values 0.
# Dataset Columns
ps_ar.columns##-- total column names are visisble here
Index(['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type',
'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver',
'Android Ver'],
dtype='object')# Dataset Describe
ps_ar.describe()#--here we are trying to find the statistical data from data set
App- application name
Category- Category of application
Rating- star rating given by users
Reviews - users review
Size- application size
Installs- number of user who have installedd the app world wide
Type- Type of application
Price- paid/ free, if paid then how much price to pay
Content Rating- content quality rating
Genres- Type of genere
Last Updated
Current Ver- current version
Android Ver- Android version
ps_ar.Genres.unique()
array(['Art & Design', 'Art & Design;Pretend Play',
'Art & Design;Creativity', 'Art & Design;Action & Adventure',
'Auto & Vehicles', 'Beauty', 'Books & Reference', 'Business',
'Comics', 'Comics;Creativity', 'Communication', 'Dating',
'Education', 'Education;Creativity', 'Education;Education',
'Education;Action & Adventure', 'Education;Pretend Play',
'Education;Brain Games', 'Entertainment',
'Entertainment;Music & Video', 'Entertainment;Brain Games',
'Entertainment;Creativity', 'Events', 'Finance', 'Food & Drink',
'Health & Fitness', 'House & Home', 'Libraries & Demo',
'Lifestyle', 'Lifestyle;Pretend Play',
'Adventure;Action & Adventure', 'Arcade', 'Casual', 'Card',
'Casual;Pretend Play', 'Strategy', 'Action', 'Puzzle', 'Sports',
'Music', 'Word', 'Racing', 'Casual;Creativity', 'Simulation',
'Adventure', 'Board', 'Trivia', 'Role Playing',
'Action;Action & Adventure', 'Casual;Brain Games',
'Simulation;Action & Adventure', 'Educational;Creativity',
'Puzzle;Brain Games', 'Educational;Education', 'Card;Brain Games',
'Educational;Brain Games', 'Educational;Pretend Play',
'Casual;Action & Adventure', 'Entertainment;Education',
'Casual;Education', 'Music;Music & Video', 'Arcade;Pretend Play',
'Simulation;Pretend Play', 'Puzzle;Creativity',
'Sports;Action & Adventure', 'Racing;Action & Adventure',
'Educational;Action & Adventure', 'Arcade;Action & Adventure',
'Entertainment;Action & Adventure', 'Puzzle;Action & Adventure',
'Role Playing;Action & Adventure', 'Strategy;Action & Adventure',
'Music & Audio;Music & Video', 'Health & Fitness;Education',
'Adventure;Education', 'Board;Brain Games',
'Board;Action & Adventure', 'Board;Pretend Play',
'Casual;Music & Video', 'Education;Music & Video',
'Role Playing;Pretend Play', 'Entertainment;Pretend Play',
'Video Players & Editors;Creativity', 'Card;Action & Adventure',
'Medical', 'Social', 'Shopping', 'Photography', 'Travel & Local',
'Travel & Local;Action & Adventure', 'Tools', 'Personalization',
'Productivity', 'Parenting', 'Parenting;Education',
'Parenting;Brain Games', 'Parenting;Music & Video', 'Weather',
'Video Players & Editors', 'News & Magazines', 'Maps & Navigation',
'Health & Fitness;Action & Adventure', 'Educational', 'Casino',
'Adventure;Brain Games', 'Video Players & Editors;Music & Video',
'Trivia;Education', 'Lifestyle;Education',
'Books & Reference;Creativity', 'Books & Reference;Education',
'Simulation;Education', 'Puzzle;Education',
'Role Playing;Education', 'Role Playing;Brain Games',
'Strategy;Education', 'Racing;Pretend Play',
'Communication;Creativity', 'Strategy;Creativity'], dtype=object)# Write your code to make your dataset analysis ready.
ps_ar.replace('0',np.nan,inplace = True)
#droping type and price column as in price column every value is nan and type column has free written.
ps_ar.drop(['Price','Type'],axis=1, inplace=True)
ps_ar
#converting the datatypes of ratings from object to float.
ps_ar[['Rating']]= ps_ar[['Rating']].astype(float)
ps_ar[['Reviews']]=ps_ar[['Reviews']].astype(float)
#imputing nan values with mean value in Rating column as it has 1474 nan values.
ps_ar['Rating'].fillna(ps_ar['Rating'].astype(float).mean(), inplace=True)
#imputing 'Varies with device' with nan values.
ps_ar.replace('Varies with device',np.nan,inplace = True)
#droping nan values in the column Current version and Android verson as values of nan can not be imputed with mean, median or mode also only 8 and 3 nan are there.
ps_ar.dropna(subset=['Current Ver','Android Ver'], axis=0, inplace=True)
ps_ar.shape
(9299, 11)
ps_ar.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 9299 entries, 0 to 10838 Data columns (total 11 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 App 9299 non-null object 1 Category 9299 non-null object 2 Rating 9299 non-null float64 3 Reviews 8738 non-null object 4 Size 9049 non-null object 5 Installs 9299 non-null object 6 Content Rating 9299 non-null object 7 Genres 9299 non-null object 8 Last Updated 9299 non-null object 9 Current Ver 9299 non-null object 10 Android Ver 9299 non-null object dtypes: float64(1), object(10) memory usage: 871.8+ KB
#Finding outliers using box plot for rating.
import matplotlib.pyplot as plt
import seaborn as sns
sns.boxplot(x='Rating', data=ps_ar)
plt.show()
Manipulations-
In price column all values were 0, so filled this zero value with nan, and droped it, however direct droping of price column can also be done without replacing the zero with nan.
converted the datatype of Rating from object to float, because its a number.
converted the datatype of Reviews and Installs by using pd.to_numeric as there values consistes of numerical and string both.
imputed the nan values in rating column with mean values as there were 1474 nan values and we can not drop such a huge number as on droping it may hamper our analysis. so nan values were imputed by mean values.
In column android verson and Current version 'Varies with device' was replaced by nan values and later on it was droped as only 8 and 3 were such cases on droping them it will not hamper our analysis.
Insights found-
Now our dataframe is ready to be analysed on various parameters and in various ways.
proper insights will be available after analysing the data and will be written in conclusion part.
Determined the outliers, so that if any unwanted data is there it can be discarded while taking decession.
# Chart - 1 visualization code
#Histogram of Category of apps
import matplotlib.pyplot as plt
import seaborn as sns
plt.hist(ps_ar['Category'], bins=20)
plt.xlabel('Category')
plt.xticks(rotation = 90)
plt.ylabel('Frequency')
plt.title('Category wise app analysis')
plt.show()
To know which Category of apps is most used,liked by end users. histogram is ploted for univariate analysis of numerical discrete values.
Are there any insights that lead to negative growth? Justify with specific reason.
yes, from the insights we get a clear idea in which category we have more scope of doing business, for example if we launch a gaming app it probability of getting successfull is more compare launching of other category app.
For any insights that lead to negative growth further investigation is needed.
# Chart - 2 visualization code
#Bar graph for category and installs
x= ps_ar.loc[:,'Category']
y=ps_ar.loc[:,'Installs']
plt.bar(x,y)
plt.title('Category vs Installs')
plt.xlabel('Category')
plt.xticks(rotation =90)
plt.ylabel('Installs')
plt.show()
For doing bi-variate analysis, between category of apps and installs. Bar plot helps in bi-variate analysis thats why it is used. From here we get insights that which category has how much downloads.
-Insights
-higest download is for- news and magazines, Personalization, productivity, travel and local,Family, medical, social, lifestyle, finanace,Business, art and design, all these are having more than 100m+ downloads.
-lowest app download is for entertainment.
-shoping, photography and sports have equal amount of downloads and is greater than vedio players,weather and parenting apps.
Are there any insights that lead to negative growth? Justify with specific reason.
yes the gained insights will help creating a positive business impact
no insights found which may lead to negative growth.
# Chart - 3 visualization code
#Bar graph for category and ratings.
x=ps_ar.loc[:,'Category']
y=ps_ar.loc[:,'Rating']
plt.bar(x,y)
plt.title('Category vs Rating')
plt.xlabel('Category')
plt.xticks(rotation =90)
plt.ylabel('Rating')
plt.show()
For doing bi-variate analysis, between category of apps and ratings. Bar plot helps in bi-variate analysis thats why it is used. From here we get insights that which category of app has how much ratings.
Most of the apps have a rating of 5 only 8 apps have rating less than 5.
Are there any insights that lead to negative growth? Justify with specific reason.
no such insights found.
# Chart - 4 visualization code
# scatter plot for reviews and rating
x=ps_ar.loc[:,'Genres']
y=ps_ar.loc[:,'Rating']
plt.bar(x,y)
plt.title('Genres vs Rating')
plt.xlabel('Genres')
plt.xticks(rotation =90)
plt.ylabel('Rating')
plt.show()
to know weather there is any relationship between geners and rating
no significant fruitfull insights are obtained from the graph.
Are there any insights that lead to negative growth? Justify with specific reason.
N/A
# Chart - 5 visualization code
#scatter plot of content rating and rating
plt.scatter(ps_ar['Rating'], ps_ar['Content Rating'])
plt.xlabel('Rating')
plt.ylabel('Content Rating')
plt.title('content rating and rating')
plt.show()
To do bi-variate analysis between content Rating and rating.
Insights-
Most of the end user have liked 10+ year age group content,mostely give an atleat avg rating more than 3.4.
wide variation in rating is seen in those content which are made for every age group people
a very few or least rating is seen for those contents which are 18+ or unrated content. it shows the least intrest of end users in those content.
Are there any insights that lead to negative growth? Justify with specific reason.
while observing the rating which is having content 18+, it is concluded that peoples are showing least intrest in them so making such content is not advisible.
The most liked content is the content which is made by keeping in mind for all age group.
The gained insights from the above graph will create a positive impact on business.
# Chart -6 visualization code
#bar plot between last updated and category
f, ax = plt.subplots(figsize=(25,5))
x=ps_ar.loc[:,'Last Updated']
y=ps_ar.loc[:,'Category']
plt.bar(x,y)
plt.title('Category vs Last Updated')
plt.xlabel('Last Updated')
plt.xticks(rotation =90)
plt.ylabel('Category')
plt.show()
# sns.swarmplot(data=ps_ar, x='Last Updated', y='Category')
# plt.show()
To do bi-variate analysis between last updated and category.
Almost Dificult to find the insights as there is a huge data labels on x asix are not vissible properly eventhough on changing the size of graph.
Are there any insights that lead to negative growth? Justify with specific reason.
N/A
# Chart - 7 visualization code
#scatter plot of Genres and Installs
plt.scatter(ps_ar['Genres'], ps_ar['Installs'])
plt.xlabel('Genres')
plt.xticks(rotation=90)
plt.ylabel('Installs')
plt.title('Genres vs Installs')
plt.show()
To do a bi-variate analysis of category and install using scatter plot.
No significant relation between category and installs is seen from the plot. Plot is showing random distribution of dots.
Are there any insights that lead to negative growth? Justify with specific reason.
N/A
Chart - 8
# Chart - 8 visualization code
# sns.histplot(x='Current Ver', data=ps_ar)
# plt.xticks(rotation =90)
#OR
plt.hist(ps_ar['Current Ver'])
plt.xlabel('Current Verson')
plt.xticks(rotation = 90)
plt.title('Current Verson analysis')
plt.show()
To do a uni- veriate analysis of current version of apps.
Most of the apps are having a version higher than 5. few apps versions are variying with device. very few apps are running on a version of less than 3.
Are there any insights that lead to negative growth? Justify with specific reason.
Those apps which are having version less tha 4 must start giving frequent updates so that end users may get benifited and stay updated with time.
Chart - 9
# Chart - 9 visualization code
#scatter plot between Rating and reviews
sns.scatterplot(data=ps_ar, x='Rating',y='Reviews')
plt.title('Rating vs Review')
plt.show()
To see the trend between Rating and Review.
Most of the reviews are clustred in the range of 3.8 to 4.8 ratings.
Are there any insights that lead to negative growth? Justify with specific reason.
This analysis of rating and review does not provide any fruitfull insights which may lead negative or positive impact on business.
# Correlation Heatmap visualization code
sns.heatmap(ps_ar[['Rating','Reviews']].corr(),cmap='vlag', annot = True)
For multi variate analysis between rating and reviews, as not much fruitfull insights were obtained from scatter plot of rating and reviews.
stronger and positive corelation is seen between rating and reviews.
# Pair Plot visualization code
sns.pairplot(data= ps_ar)
For multi variate analysis.
Rating are clustred is a specific range of 3.8 to 4.8. It means most of the users has provided a positive review and have given a good rating.
Few users have given a rating of 5 and few have given below 2 average rating.
Explain Briefly.
Following points are suggested to client to achive Business Objective:-
Frequent updates should be provided to the customes in this fast changing world.
Most of the end users are likeing news and magazines, Personalization, productivity, travel and local,Family, medical, social, lifestyle, finanace,Business, art and design, types of application, so we must focus on these segments only.
Little scope of scaling up of user is observed in shoping, photography and sports segment of applications.
Conteneous engagment of customer and there feedback must be taken to improve the content quality of app to increase the end user engagment with the developed application.
Higher number of user's review shows positive sentiments of users for the perticular application.
In this project of analyzing Google Play Store applications, i have mostly focused on finding out the relationship between Rating and Installations number and what is the expected rating if number of reviews and installation is provided.
I have started Data Science process which is Data Preparing, Data Cleansing and Data Analysis. In Data Cleansing, i have performed few steps to ensure the data quality such as removing NAN values. With the cleansed data, i have perform Exploratory Data Analysis to understand our dataset like number of installation for each category.
From the results, we can know the relationships between Geners and installations is a very weak relation as close as no relations at all. We can say that the trends between Geners and Installations is not dependent to each other.
From the results and process i have implemented, we can conclude that i have achieved this group project objectives which are analyzing the Google Play Store apps and determine trends of the Google Play Store and our focused questions.